feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring by behinddwalls · Pull Request #177 · uber/submitqueue

behinddwalls · 2026-06-02T18:09:44Z

Summary

Why?

The orchestrator's build stage needs to drive the BuildRunner contract
end-to-end: trigger the runner, persist the result, and poll Status
until terminal so the batch state machine can react. Polling has to
behave like the rest of the pipeline (queue-driven, partition-isolated,
restart-safe) rather than running as an in-process timer loop.

What?

Stacks on top of the BuildRunner interface, noop, and PublishAfter
PRs. Wires the runner into the orchestrator pipeline.

The build poll loop runs as queue traffic inside the existing
buildsignal consumer (no separate stage). On each delivery it loads the
Build from storage, calls BuildRunner.Status, persists the result via
BuildStore.UpdateStatus, publishes the batch ID to speculate so the
state machine re-evaluates, and re-publishes itself via
Publisher.PublishAfter until the build reaches a terminal state. A
webhook-capable backend can publish into the same topic — the consumer
cannot tell a poll-driven message from a push.

Only the build ID travels on the queue (entity.BuildID); the
consumer reloads the full Build from BuildStore, keeping the message
small and storage the single source of truth — the same ID-on-the-queue,
load-from-storage pattern the rest of the pipeline already uses for
batches and requests. The controllers consume the runner's
entity.BuildID signatures (Trigger returns one; Status takes one).

Pieces:

orchestrator/controller/build: assembles base from
batch.Dependencies and head from batch.Contains, calls
Trigger, persists the initial Build{Accepted} via
BuildStore.Create (ErrAlreadyExists is swallowed for redelivery),
publishes the build ID to buildsignal.
orchestrator/controller/buildsignal: the polling consumer described
above. It loads the Build by ID, then polls. PollDelayAcceptedMs=5000,
PollDelayRunningMs=2000 by default (vars so tests can override; a TODO
notes these should move into the queueconfig extension). Error
classification: only the PublishAfter re-schedule is wrapped retryable
(errs.NewRetryableError) — it is the poll loop's heartbeat, so a
transient enqueue blip nacks and replays (up to MaxAttempts) rather
than rejecting the loop's only live message straight to DLQ. Deserialize,
the Build load, Status, UpdateStatus, and the speculate publish
stay non-retryable and reject to DLQ on first failure, where an
operational republish is the recovery path.
example/server/orchestrator/main.go: passes the BuildRunner to
both build.NewController and buildsignal.NewController; pipeline
diagram updated.
root BUILD.bazel: adds # gazelle:exclude .claude so gazelle does
not index nested worktrees as duplicate rule definitions and corrupt
the canonical BUILD files.

Test Plan

✅ bazel test //extension/buildrunner/... //orchestrator/controller/build/... //orchestrator/controller/buildsignal/... //extension/queue/... — all pass.
✅ make fmt lint check-tidy check-gazelle check-mocks — clean.
✅ make build — all targets compile.
New coverage: build controller persist+publish path (with
ErrAlreadyExists swallow), buildsignal poll loop (terminal forwards
to speculate, non-terminal re-publishes via PublishAfter with
per-status delay, retryable re-publish failure, non-retryable build-load
/ Status / UpdateStatus failures reject to DLQ).

) ## Summary ### Why? The build stage needs a vendor-agnostic abstraction for talking to a Build Runner — one that fits the existing extension family (storage, queue, conflict) and that does not have to change when the first real backend lands. Design rationale lives in `doc/rfc/build-runner.md`. ### What? Introduces the `BuildRunner` contract — three verbs, all keyed by an `entity.BuildID`: `Trigger` submits a build (ordered `base` + `head` change sets plus a free-form metadata map) and returns its ID; `Status` fetches the current `BuildStatus` and runner-defined metadata; `Cancel` requests cancellation. See `extension/buildrunner/build_runner.go` for the signatures. Highlights: - `Trigger` takes ordered `base` (assumed-good prefix from dependency batches) and `head` (changes being verified) — a runner can cache or short-circuit a prefix it has validated before, and attribute terminal failures to base vs head via `BuildMetadata`. - Build identifiers are the typed `entity.BuildID` (a lightweight `{ID string}` value, also the queue payload envelope) rather than a bare `string`, so the runner contract and the on-queue messages share one identifier type and the compiler distinguishes a build ID from other string IDs. - `metadata` is a free-form `map[string]string` for caller-supplied attributes (requester, ticket ID, trace ID) the runner MAY persist or echo back via `Status`. - `Trigger` and `Cancel` return promptly; `Status` MAY be lengthy. - The `BuildStatus` enum is narrowed to `Accepted` / `Running` / `Succeeded` / `Failed` / `Cancelled` so every backend can map its native lifecycle without leaking runner-specific stages. - `BuildMetadata` is added as a free-form map for round-tripping runner-defined attributes (build URL, duration, SHA, etc.). Naming: the interface is `BuildRunner` (not `BuildManager` — the "Manager" suffix doesn't describe what it does, and "Build" alone would collide cognitively with `entity.Build`). The package is `extension/buildrunner` to follow the repo convention used by `mergechecker`, `changeprovider`, `pusher`, `counter`, and `storage`. Implementation (noop backend, queue `PublishAfter` primitive, controller wiring, and the poll-driven `buildsignal` loop) lands in the follow-up PRs stacked on top of this one. ## Test Plan - ✅ `make test` — entity/build tests pass; mock-only consumers of the interface compile and pass. - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile. ## Stack 1. @ #175 1. #179 1. #176 1. #177

## Summary ### Why? Wiring tests for the orchestrator's build stage need a `BuildRunner` implementation, but real backends (BuildKite, Jenkins, etc.) won't land until later. A noop that immediately reports success is also useful as the local default in `make local-start` and as a best-case baseline where every build passes. ### What? Adds `extension/buildrunner/noop` — a `BuildRunner` that does no real work. Every `Trigger` returns a unique `noop-<n>` build ID with no error; every `Status` returns `BuildStatusSucceeded` with no metadata; `Cancel` is a no-op. The atomic counter keeps IDs unique under concurrent use. ## Test Plan - ✅ `make test` — 36 unit tests pass, including the new `extension/buildrunner/noop:noop_test` (Trigger uniqueness, Status, Cancel, interface satisfaction). - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile. ## Issues ## Stack 1. #175 1. @ #179 1. #176 1. #177

## Summary ### Why? The orchestrator's build poll loop needs to space `Status` calls out without consuming `retry_count` / DLQ retry slots. `Delivery.Nack` does schedule redelivery after a delay, but it overloads `retry_count` — which the operator relies on to flag real failures. The fix is a separate "postpone this work" primitive, distinct from "this delivery failed, try again", so both signals stay meaningful. ### What? Adds `Publisher.PublishAfter(ctx, topic, msg, delayMs)` to the queue extension: a fresh message inserted into the topic, made visible to subscribers only after `delayMs` from now. Distinct from `Delivery.Nack(requeueAfterMillis)`: - `Nack` is "this delivery failed, try again" — it bumps `retry_count` and eventually trips DLQ. - `PublishAfter` is "postpone this work" — `retry_count` resets to 0, DLQ stays available for true failures. SQL backing in `extension/queue/mysql`: - New `visible_after BIGINT UNSIGNED NOT NULL DEFAULT 0` column on `queue_messages`. Default 0 means immediately visible — back-compat for any existing rows. - `messageStore.InsertDelayed(ctx, topic, messages, visibleAfterMs)` is the underlying primitive. `Insert` is now a thin wrapper that passes `visibleAfterMs = 0`. - `FetchByOffset` gains a `nowMs` parameter and an `AND visible_after <= ?` predicate so subscribers skip rows whose delivery is still deferred. - `MoveToDLQ` writes `visible_after = 0` explicitly: any delay on the original message has already been consumed by the time it failed. The first consumer of `PublishAfter` is the orchestrator's poll-driven `buildsignal` loop, which lands in the stacked PR on top of this one. ## Test Plan - ✅ `bazel test //extension/queue/...` — all pass, including new coverage for `PublishAfter` (positive / zero / negative delay, closed-publisher) and for `FetchByOffset` skipping deferred rows via the new `visible_after` predicate. - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile. ## Issues ## Stack 1. #175 1. #179 1. @ #176 1. #177

## Summary ### Why? The orchestrator's build stage needs to drive the `BuildRunner` contract end-to-end: trigger the runner, persist the result, and poll `Status` until terminal so the batch state machine can react. Polling has to behave like the rest of the pipeline (queue-driven, partition-isolated, restart-safe) rather than running as an in-process timer loop. ### What? Stacks on top of the BuildRunner interface, noop, and `PublishAfter` PRs. Wires the runner into the orchestrator pipeline. The build poll loop runs as queue traffic inside the existing `buildsignal` consumer (no separate stage). On each delivery it loads the `Build` from storage, calls `BuildRunner.Status`, persists the result via `BuildStore.UpdateStatus`, publishes the batch ID to `speculate` so the state machine re-evaluates, and re-publishes itself via `Publisher.PublishAfter` until the build reaches a terminal state. A webhook-capable backend can publish into the same topic — the consumer cannot tell a poll-driven message from a push. Only the build **ID** travels on the queue (`entity.BuildID`); the consumer reloads the full `Build` from `BuildStore`, keeping the message small and storage the single source of truth — the same ID-on-the-queue, load-from-storage pattern the rest of the pipeline already uses for batches and requests. The controllers consume the runner's `entity.BuildID` signatures (`Trigger` returns one; `Status` takes one). Pieces: - `orchestrator/controller/build`: assembles `base` from `batch.Dependencies` and `head` from `batch.Contains`, calls `Trigger`, persists the initial `Build{Accepted}` via `BuildStore.Create` (`ErrAlreadyExists` is swallowed for redelivery), publishes the build ID to `buildsignal`. - `orchestrator/controller/buildsignal`: the polling consumer described above. It loads the `Build` by ID, then polls. `PollDelayAcceptedMs=5000`, `PollDelayRunningMs=2000` by default (vars so tests can override; a TODO notes these should move into the `queueconfig` extension). Error classification: only the `PublishAfter` re-schedule is wrapped retryable (`errs.NewRetryableError`) — it is the poll loop's heartbeat, so a transient enqueue blip nacks and replays (up to `MaxAttempts`) rather than rejecting the loop's only live message straight to DLQ. Deserialize, the `Build` load, `Status`, `UpdateStatus`, and the speculate publish stay non-retryable and reject to DLQ on first failure, where an operational republish is the recovery path. - `example/server/orchestrator/main.go`: passes the `BuildRunner` to both `build.NewController` and `buildsignal.NewController`; pipeline diagram updated. - root `BUILD.bazel`: adds `# gazelle:exclude .claude` so gazelle does not index nested worktrees as duplicate rule definitions and corrupt the canonical BUILD files. ## Test Plan - ✅ `bazel test //extension/buildrunner/... //orchestrator/controller/build/... //orchestrator/controller/buildsignal/... //extension/queue/...` — all pass. - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile. - New coverage: build controller persist+publish path (with `ErrAlreadyExists` swallow), buildsignal poll loop (terminal forwards to speculate, non-terminal re-publishes via `PublishAfter` with per-status delay, retryable re-publish failure, non-retryable build-load / `Status` / `UpdateStatus` failures reject to DLQ).

This was referenced Jun 2, 2026

feat(queue): add Publisher.PublishAfter delay primitive #176

Merged

feat(buildrunner): add vendor-agnostic BuildRunner interface + RFC #175

Merged

behinddwalls force-pushed the preetam/build-ext branch from ab94b8a to c5e5759 Compare June 2, 2026 18:28

behinddwalls force-pushed the preetam/queue-publish-after branch from 42a34ea to 2dd6d25 Compare June 2, 2026 18:28

behinddwalls changed the title ~~feat(buildrunner): noop impl, PublishAfter queue primitive, poll-driven buildsignal~~ feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring Jun 2, 2026

behinddwalls force-pushed the preetam/queue-publish-after branch from 2dd6d25 to a7d0c2a Compare June 2, 2026 18:47

behinddwalls force-pushed the preetam/build-ext branch from c5e5759 to 70ddd63 Compare June 2, 2026 18:47

behinddwalls mentioned this pull request Jun 2, 2026

feat(buildrunner/noop): add pass-through BuildRunner stub #179

Merged

behinddwalls force-pushed the preetam/queue-publish-after branch from a7d0c2a to d83c3a5 Compare June 2, 2026 18:57

behinddwalls force-pushed the preetam/build-ext branch from 70ddd63 to 55b194d Compare June 2, 2026 18:57

behinddwalls force-pushed the preetam/queue-publish-after branch from d83c3a5 to d735e8f Compare June 2, 2026 22:04

behinddwalls force-pushed the preetam/build-ext branch from 55b194d to c0fd951 Compare June 2, 2026 22:04

behinddwalls force-pushed the preetam/queue-publish-after branch from d735e8f to c368626 Compare June 2, 2026 22:29

behinddwalls force-pushed the preetam/build-ext branch 2 times, most recently from e298f94 to de8932c Compare June 2, 2026 22:53

behinddwalls force-pushed the preetam/queue-publish-after branch from c368626 to 64d1a6e Compare June 2, 2026 22:53

behinddwalls force-pushed the preetam/build-ext branch from de8932c to 004ca8a Compare June 2, 2026 23:08

behinddwalls force-pushed the preetam/queue-publish-after branch from 64d1a6e to 1311592 Compare June 2, 2026 23:08

behinddwalls marked this pull request as ready for review June 2, 2026 23:11

behinddwalls requested review from a team and sbalabanov as code owners June 2, 2026 23:11

Base automatically changed from preetam/queue-publish-after to main June 3, 2026 02:01

JamyDev approved these changes Jun 3, 2026

View reviewed changes

behinddwalls force-pushed the preetam/build-ext branch from 004ca8a to 48890aa Compare June 3, 2026 03:15

behinddwalls merged commit 1106427 into main Jun 3, 2026
14 checks passed

behinddwalls deleted the preetam/build-ext branch June 3, 2026 03:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring#177

feat(buildrunner): noop impl, poll-driven buildsignal, pipeline wiring#177
behinddwalls merged 1 commit into
mainfrom
preetam/build-ext

behinddwalls commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

behinddwalls commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why?

What?

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

behinddwalls commented Jun 2, 2026 •

edited

Loading